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Abstract. We introduce a new approach to LZ77 factorization that 
uses 0{n/d) words of working space and 0(dn) time for any d > 1 (for 
polylogarithmic alphabet sizes). We also describe carefully engineered 
implementations of alternative approaches to lightweight LZ77 factor- 
ization. Extensive experiments show that the new algorithm is superior 
in most cases, particularly at the lowest memory levels and for highly 
repetitive data. As a part of the algorithm, we describe new methods for 
computing matching statistics which may be of independent interest. 

1 Introduction 

The Lempel-Ziv factorization [27], also known as the LZ77 factorization, or LZ77 
parsing, is a fundamental tool for compressing data and string processing, and 
has recently become the basis for several compressed full-text pattern matching 
indexes |17lll| . These indexes are designed to efficiently store and search mas- 
sive, highly-repetitive data sets — such as web crawls, genome collections, and 
versioned code repositories — which are increasingly common [21 . 

In traditional compression settings (for example the popular gzip tool) LZ77 
factorization is kept timely by factorizing relative to only a small, recent window 
of data, or by breaking the data up into blocks and factorizing each block sepa- 
rately. This approach fails to capture widely spaced repetitions in the input, and 
anyway, in many applications, including construction of the above mentioned 
LZ77-based text indexes, whole-string LZ77 factorizations are required. 

The fastest LZ77 algorithms (see [15|12j ) use a lot of space, at least 6n bytes 
for an input of n symbols and often more. This prevents them from scaling 
to really large inputs. Space-efficient algorithms are desirable even on smaller 
inputs, as they place less burden on the underlying system. 

One approach to more space efficient LZ factorization is to use compressed 
sufRx arrays and succinct data structures [22]. Two proposals in this direction 
are due to Kreft and Navarro [TB] and Ohlebusch and Gog [23]. In this paper, 
we describe carefully engineered implementations of these algorithms. We also 
propose a new, space-efficient variant of the recent ISA family of algorithms |15j . 
Most compressed index implementations are build from the uncompressed suffix 
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array (SA) which requires 4n bytes. Our implementations are instead based on 
the Burrows- Wheeler transform (BWT), constructed directly in about 2-2. 5n 
bytes using the algorithm of Okanohara and Sadakane [2S]. There also exists 
two online algorithms based on compressed indexes |24I26| but they are not 
competitive in practice in the offline context. 

The main contribution of this paper is a new algorithm to compute the LZ77 
factorization without ever constructing SA or BWT for the whole input. At a 
high-level, the algorithm divides the input up into blocks, and processes each 
block in turn, by first computing a pattern matching index for the block, then 
scanning the prefix of the input prior to the block through the index to compute 
longest-matches, which are then massaged into LZ77 factors. For a string of 
length n and a distinct symbols, the algorithm uses nlogcr -I- 0(n log n/c?) bits 
of space, and O(dnirank) time, where d is the number of blocks, and irank is the 
time complexity of the rank operation over sequences with alphabet size a (see 
e.g. [2]). The nlogcr bits in the space bound is for the input string itself which 
is treated as read-only. 

Our implementation of the new algorithm does not, for the most part, use 
compressed or succinct data structures. The goal is to optimize speed rather 
than space in the data structures, because we can use the parameter d to control 
the tradeoff. Our experiments demonstrate that this approach is in most cases 
superior to algorithms using compressed indexes. 

As a part of the new algorithm, we describe new techniques for computing 
matching statistics [S] that may be of independent interest. In particular, we 
show how to invert matching statistics, i.e., to compute the matching statistics 
of a string B w.r.t. a string A from the matching statistics of A w.r.t. B, which 
saves a lot of space when A is much longer than B. 

All our implementations operate in main memory only and thus need at 
least n bytes just to hold the input. Reducing the memory consumption further 
requires some use of external memory, a direction largely unexplored in the 
literature so far. We speculate that the scanning, block oriented nature of the 
new algorithm will allow efficient secondary memory implementations, but that 
study is left for the future. 

2 Basic Notation and Algorithmic Machinery 

Strings. Throughout we consider a string X = X[l,n] = X[1]X[2] . . . X[n] of 
|X| — n symbols drawn from the alphabet [0, ct — 1]. We assume X[n] is a special 
"end of string" symbol, $, smaller than all other symbols in the alphabet. The 
reverse of X is denoted X. For j = 1, . . . , n we write X[i,n] to denote the suffix of 
X of length n — i + that is X[i, n] — X[i]X[i -I- 1] . . . X[n\. We will often refer to 
suffix X[i,n] simply as "suffix i". Similarly, we write X[l,i] to denote the prefix 
of X of length i. X[i,j] is the substring X[i]X[i -I- 1] . . .X[j] of X that starts at 
position i and ends at position j. By X[i,j) we denote X[i,j — 1]. If j < i we 
define X[i,j] to be the empty string, also denoted by e. 
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Suffix Arrays. The sufRx array 19 SAx (we drop subscripts when they are clear 
from the context) of a string X is an array SA[1, n] which contains a permutation 
of the integers [l,n] such that X[SA[l],n] < X[SA[2],n] < ••• < X[SA[n],n]. In 
other words, SA[j] = i iff X[i, n] is the j^^ suffix of X in ascending lexicographical 
order. The inverse suffix array ISA is the inverse permutation of SA, that is 
ISA[i] = j iff SA[j] = i. 

Let \cp{i,j) denote the length of the longest-common-prefix of suffix i and 
sufRx j. For example, in the string X = zzzzzapzap, lcp(l,4) = 2 = \zz\, and 
lcp(5,8) = 3 = \zap\. The longest-common-prefix (LCP) array [14"13', LCPx = 
LCP[l,n], is defined such that LCP[1] = 0, and LCP[i] = lcp(SA[i], SA[i - 1]) for 
i e [2,n]. 

For a string Y, the Y-interval in the suffix array SAx is the interval SA[s, e] 
that contains all suffixes having Y as a prefix. The Y-interval is a representation 
of the occurrences of Y in X. For a character c and a string Y, the computation 
of cY-interval from Y-interval is called a left extension and the computation of 
Y-interval from Yc-interval is called a right contraction. Left contraction and 
right extension are defined symmetrically. 

BWT and backward search. The Burrows- Wheeler Transform ;3j BWT[l,ri] is a 
permutation of X such that BWT[i] = X[SA[j] — 1] if SA[i] > 1 and S otherwise. 
We also define LF[i] = j iff SA[j] = SA[i] — 1, except when SA[i] = 1, in which 
case LF[j] = ISA[n]. Let C[c], for symbol c be the number of symbols in X 
lexicographically smaller than c. The function rank(X, c, i), for string X, symbol 
c, and integer i, returns the number of occurrences of c in X[l,i]. It is well 
known that LF[i] = C[BWT[i]] + rank(BWT, BWT[i], i). Furthermore, we can 
compute the left extension using C and rank. If SA[s, e] is the Y-interval, then 
SA[C[c] + rank(BWT, c, s), C[c] + rank(BWT, c, e)] is the cY-interval. This is called 
backward search. 

NSV/PSV and RMQ. For an array A, the next and previous smaller value 
(NSV/PSV) operations are defined as NSV[i] = min{j e[i + l,n] \ A[j] < A[i]} 
and PSV[z] — max{j S [1,« — 1] | A[j] < A[i]}. A related operation on A is range 
minimum query: RMQ(A, i, j) is k £ such that A[fc] is the minimum value 
in A[i,j]. Both NSV/PSV operations and RMQ operations over the LCP array 
can be used for implementing right contraction (see Section |4]). 

LZ77. Before defining the LZ77 factorization, we introduce the concept of a 
longest previous factor (LPF). The LPF at position i in string X is a pair 
LPFx[«] = {Pi,(i) such that, pi < i, X[pi,pi + £i) — X[i,i + £i), and ii is maxi- 
mized. In other words, X[i,i + (ii) is the longest prefix of X[i, n] which also occurs 
at some position < i in X. Note that if X[i] is the leftmost occurrence of a 
symbol in X then pi does not exist. In this case we adopt the convention that 
Pi = X[i] and ii = 0. When pi does exist we call X[pi,pi + £i) the source for 
position i. Note also that there may be more than one potential source (that is. 
Pi value), and we do not care which one is used. 
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The \Jini factorization (or UZll parsing) of a string X is then just a greedy, 
left-to-right parsing of X into longest previous factors. More precisely, if the jth 
LZ factor (or phrase) in the parsing is to start at position i, then we output 
{pi,£i) (to represent the jth phrase), and then the (j -I- l)th phrase starts at 
position i + £i, unless = 0, in which case the next phrase starts at position 
i + 1. When £i > 0, the substring X[pi,pi + £i) is called the source of phrase 
X[i,i + £i). We denote the number of phrases in the LZ77 parsing of X by z. 

Matching Statistics. Given two strings Y and Z, the matching statistics of Y 
w.r.t. Z, denoted MSy|z is an array of |Y| pairs, (pi,^i), (^2,^2), (P|y|j^|y|)i 
such that for all i £ [1, |Y|], Y[i,i + £i) = Z[pi,pi + £i) is the longest substring 
starting at position i in Y that is also a substring of Z. The observant reader will 
note the resemblance to the LPF array. Indeed, if we replace LPFy with MSy|z 
in the computation of the LZ factorization of Y, the result is the relative LZ 
factorization of Y w.r.t. Z [T5]. 

3 Lightweight, Scan-based LZ77 Parsing 

In this section we present a new algorithm for LZ77 factorization called LZscan. 

Basic Algorithm. Conceptually LZscan divides X up into d = \n/h'\ fixed size 
blocks of length h: X[l, 6], X[& -f 1, 26], ... . The last block could be smaller than 
6, but this does not change the operation of the algorithm. In the description 
that follows we will refer to the block currently under consideration as B, and 
to the prefix of X that ends just before B as A. Thus, if B = y\kh -|- 1, (fc + 1)6], 
then A = X[l,fc6]. 

To begin with we will assume no LZ factor or its source crosses a boundary 
of the block B. Later we will show how to remove these assumptions. 
The outline of the algorithm for processing a block B is shown below. 

1. Compute MSa|b 

2. Compute MSb|a from MSa|Bi SAb and LCPb 

3. Compute LPFAB[fc6 + 1, (A: + 1)6] from MSb|a and LPFb 

4. Factorize B using LPFAB[fc6 +l,{k + 1)6] 

Step 1 is the computational bottleneck of the algorithm in theory and practice. 
Theoretically, the time complexity of Step 1 is 0((1^1 -I- li?l)irank), where trank is 
the time complexity of the rank operation on BWTb (see, e.g., |2]). Thus the total 
time complexity of LZscan is 0((irttrank) using 0(6) words of space in addition 
to input and output. The practical implementation of Step 1 is described in 
Section |4] In the rest of this section, we describe the details of the other steps. 

Step 2: Inverting Matching Statistics. We want to compute MSb|a but we cannot 
afford the space of the large data structures on A required by standard methods. 
Instead, we compute first MSa|b involving large data structures on B, which we 
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Algorithm MS-Invert 

1: for i 1 to |B| do MSbiaW ^ (0,0) 

2: for i 1 to \A\ do 

3: (pa,^a) ^ MSaibW 

4: (pb,^b) ^ MSb|a[pa] 

5: if ^A > 4 then MSeiAbA] ^ (i,^A) 

6: (p,^)^MSb|a[SAb[1]1 

7: for i 2 to |B| do 

8: ^ 4- min(£, LCPbH) 

9: (pb,^b) ^ MSbiaPAbM] 

10: if ^ > £b then MSb|a[SAbH] ^ {p,e) 

11: else (p,^) ^ (pb,^b) 

12: (p,^)^MSb|a[SAb[|B|]] 

13: for i ^ \B\ — 1 downto 1 do 

14: £ ^ min(^, LCPB[i + 1]) 

15: (pb,^b) ^ MSb|a[SAbM] 

16: if £ > £b then MSb|a[SAbH] ^ (pj) 

17: else (p,^) 4- (pb,^b) 

Fig. 1. Inverting matching statistics 

can afford, and only a scan of A (see Section[4]for details). We then invert MSa|b 
to obtain MSb|a- The inversion algorithm is given in Fig.IT| 

Note that the algorithm accesses each entry of MSa|b only once and the order 
of these accesses does not matter. Thus we can execute the code on lines 3-5 
immediately after computing MSa|bW Step 1 and then discard that value. 
This way we can avoid storing MSa|b- 

StepS: Computing LPF. Consider the pair {p, t) = LPFabW for i £ [kh+ 1, [k + 
1)6] that we want to compute and assume £ > (otherwise i is the position of 
the leftmost occurrence of X[i] in X, which we can easily detect). Clearly, either 
p < kb and LPFabW = MSb|aWi ot kb < p < i and LPFabW = {kb + pbJb), 
where (pbJb) — LPFB[i — kb]. Thus computing LPFab from MSb|aM LPFb 
is easy. 

The above is true if the sources do not cross the block boundary, but the 
case where p < kb but p + £ > kb+1 is not handled correctly. An easy correction 
is to replace MSa|b with MSab|b[1, kb] in all of the steps. 

Step 4: Parsing. We use the standard LZ77 parsing to factorize B except LPFb 
is replaced with LPFab[^6 + 1, (fc + 1)6]. 

So far we have assumed that every block starts with a new phrase, or, put 
another way, that a phrase ends at the end of every block. Let X[i, [k + 1)6] the 
last factor in B, after we have factorized B as described above. This may not 
be a true LZ factor when considering the whole X but may continue beyond the 
end of B. To find the true end point, we treat X[i, n] as a pattern, and apply the 
constant extra space pattern matching algorithm of Crochemore [7] , looking for 
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the longest prefix of X[i, n] starting in X[l, i — I]. We must modify the algorithm 
from [7] so that it matches prefixes rather than whole occurrences of the pattern, 
but this is possible without increasing its time or space complexity. 

4 Computation of matching statistics 

In this section, we describe how to compute the matching statistics MSa|b- As 
mentioned in Sectionjsj what we really want is MSab|b[1j kb]. However, the only 
difference is that the starting point of the computation is the B-interval in SAb 
instead of the £-interval. 

Similarly to most algorithms for computing the matching statistics, we first 
construct some data structures on B and then scan A. During the whole LZ 
factorization, most of the time is spend on the scanning and the time for con- 
structing the data structures is insignificant in practice. Thus we omit the con- 
struction details here. The space requirement of the data structures is more 
important but not critical as we can compensate for increased space by reduc- 
ing the block size b. Using more space (per character of B) is worth doing if it 
increases scanning speed more than it increases space. Consequently, we mostly 
use plain, uncompressed arrays. 

Standard approach. The standard approach of computing the matching statis- 
tics using the suffix array is to compute for each position i the longest prefix 
Pi = A[i,i + £i) of the suffix A[i, \A\] such that the Pi-interval in SAb is non- 
empty. Then MSa|bH — (Pii^i): where pi is any suffix in the Pi-interval. This 
can be done either with a forward scan of A, computing each Pi-interval from 
Pi_i -interval using the extend right and contract left operations J], or with a 
backward scan computing each P^-interval from Pi+i-interval using the extend 
left and contract right operations f23j. We use the latter alternative but with 
bigger and faster data structures. 

The extend left operation is implemented by backward search. We need the 
array C of size a and an implementation of the rank function on BWT. For the 
latter, we use the fast rank data structure of 0, which uses 4& bytes. 

The contract right operation is implemented using the NSV and PSV oper- 
ations on LCPb as in [53], but instead of a compressed representation, we store 
the NSV and PSV values as plain arrays. As a nod towards reducing space, we 
store the NSV/PSV values as offsets using 2 bytes each. If the offset is too large 
(which is very rare), we obtain the value using the NSV/PSV data structure of 
Canovas and Navarro [4], which needs less than 0.16 bytes. Here the space saving 
was worth it as it had essentially no effect on speed. 

The peak memory use of the resulting algorithm is n -f (24.1)6 -|- 0(cr) bytes. 

New approach. Our second approach is similar to the first, but instead of main- 
taining both end points of the Pi-interval, we keep just one, arbitrary position Si 
within the interval. In principle, we perform left extension by backward search, 
i.e.. Si — C[X[i]]-}-rank(BWT,X[i],Si+i). However, checking whether the resulting 
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interval is empty and performing right contractions if it is, is more involved. To 
compute Si and £i from Sj+i and ii+i, we execute the following steps: 

1. Let c = X[i]. If BWT[si+i] = c, set Sj = C[c] + rank(BWT, c, s^+i) and 

= + 1. 

2. Otherwise, let BWT[i6] be the nearest occurrence of c in BWT before the 
position Sj+i. Compute the rank of that occurrence r = rank(BWT, c, zt) 
and £u = LCP[RMQ(LCP, m + l,Si+i)]. If 4 > ii+i, set Si = C[c] + r and 

= + 1. 

3. Otherwise, let BWT[i;] be the nearest occurrence of c in BWT after the 
position Sj+i and compute f„ = LCP[RMQ(LCP, Sj+i + '^,v)\. If < £u, set 

Si = C[c] + r and £i ~ £u + ^■ 

4. Otherwise, set Si — C[c] + r + 1 and £i = min(^i+i, £„) + 1. 

The implementation of the above algorithm is based on the arrays BWT, 
LCP and R[l,6], where R\i] = rank(BWT, BWT[i], i). All the above operations 
can be performed by scanning BWT and LCP starting from the position Sj+i 
and accessing one value in R. To avoid long scans, we divide BWT and LCP into 
blocks of size 2a, and store for each block and each symbol c, the values r, £u and 
£v that would get computed if scans starting inside the block continued beyond 
the block boundaries. 

The peak memory use is n + 276 + 0(cr) bytes. This is more than in the first 
approach, but this is more than compensated by increased scanning speed. 

Skipping repetitions. During the preceding stages of the LZ factorization, we have 
built up knowledge of repetition present in A, which can be exploited to skip 
(sometimes large) parts of A during the matching-statistics scan. Consider an 
LZ factor A[i,i + £). Because, by definition, A[i,i + £) occurs earlier in A too, any 
source of an LZ factor of B that is completely inside A[i,i + l) could be replaced 
with an equivalent source in that earlier occurrence. Thus such factors can be 
skipped during the computation of MSa|b without an effect on the factorization. 

More precisely, if during the scan we compute MSAisb] = (P: ^) ^^'^ ^^id that 
i < j < j + k < i + £ for a,nLZ factor A[i,i + £), we will compute MSa|b[* — 1] and 
continue the scanning from i — 1. However, we will do this only for long phrases 
with £ > 40. To compute MSa|b[* ~ 1] from scratch, we use right extension 
operations implemented by a binary search on SA. 

To implement this "skipping trick" we use a bitvector of n bits to mark LZ77 
phrase boimdaries adding 0.125n bytes to the peak memory. 

5 Algorithms Bcised on Compressed Indexes 

We went to some effort to ensure the baseline system used to evaluate LZscan 
in our experiments was not a "straw man". This required careful study and 
improvement of some existing approaches, which we now describe. 
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FM-Index. The main data structure in all the algorithms below is an implemen- 
tation of the FM-index (FMI) [9 . It consists of two main components: 

— BWTx with support for the rank operation. This enables backward search 
and the LF operation as described in Section [2j We have tried several rank 
data structures and found the one by Navarro [Ml Sect. 7.1] to be the best 
in practice. 

— A sampling of SAx- This together with the LF operation enables arbitrary 
SA access since SA[i] = SA[LF'^'[i]] + k for any k < SA[i]. The sampling rate 
is a major space-time tradeoff parameter. 

In many implementations of FMI, the construction starts with computing the 
uncompressed suffix array but we cannot afford the space. Instead, we construct 
BWT directly using the algorithm of Okanohara and Sadakane [25] . The method 
uses roughly 2-2. 5n bytes of space but destroys the text, which is required later 
during LZ parsing. Thus, once we have BWT, we build a rank structure over it 
and use it to invert the BWT. During the inversion process we recover and store 
the text and gather the SA sample values. 

CPS2 simulation. The CPS2 algorithm [6 is an LZ parsing algorithm based on 
SAx. To compute the LZ factor starting at i, it computes the X[«, i-|-^)-interval for 
£ = 1, 2, 3, ... as long as the X[«, i + £)-interval contains a value p < i, indicating 
an occurrence of X[i, i + £) starting at p. 

The key operations in CPS2 are right extension and checking whether an 
SA interval contains a value smaller than i. Kreft and Navarro [IB] as well as 
Ohlebusch and Gog [23] are using FMI for X, the reverse of X, which allows 
simulating right extension on SAx by left extension on SA^. The two algorithms 
differ in the way they implement the interval checks: 

— Kreft and Navarro use the RMQ operation. They use the RMQ data struc- 
ture by Fischer and Heun [10] but we use the one by Canovas and Navarro [4] . 
The latter is easy and fast to construct during BWT inversion but queries 
are slow without an explicit SA. We speed up queries by replacing a general 
RMQ with the check whether the interval contains a value smaller than i. 
This implementation is called LZ-FMI-CN. 

— Ohlebusch and Gog use NSV/PSV queries. The position s of i in SA must be 
in the X[z, i+^)-interval. Thus we just need to check whether either NSV[s] or 
PSV[s] is in the interval too. They as well as we implement NSV/PSV using a 
balanced parentheses representation (BPR). This representation is initialized 
by accessing the values of SA left-to-right, which makes the construction slow 
using FMI. However, NSV/PSV queries with this data structure are fast, as 
they do not require accessing SA. This implementation is called LZ-FMI-BPR. 

ISA variant. Among the most space efficient prior LZ factorization algorithms 
are those of the ISA family [TS] that use a sampled ISA, a full SA and a rank/LF 
implementation that relies on the presence of the full SA. We reduce the space 
further by replacing SA and the rank/LF data structure with the FM-index 
described above to obtain an algorithm called LZ-FMI-ISA. 



Lightweight Lempel-Ziv Parsing 9 



Name 


a 


n/z 


n/220 


Source 


Description 


dna 


16 


14.2 


100 


S 


Human genome 


english 


215 


14.1 


100 


S 


Gutenberg Project 


sources 


227 


16.8 


100 


S 


Linux and GCC sources 


cere 


5 


84 


100 


R 


yeast genome 


einstein 


121 


2947 


100 


R 


Wikipedia articles 


kernel 


160 


156 


100 


R 


Linux Kernel sources 



Table 1. Data set used in the experiments. The files are 100MB prefixes of files from 
the Pizza & Chili standard corpus^ (S) and the Pizza & Chili repetitive corpus'^ (R). 
The value of n/z (average length of an LZ77 phrase) is included as a measure of 
repetitiveness. 



6 Experiments 

We performed experiments with the files Hsted in Table [ij All tests were con- 
ducted on a 2.53GHz Intel Xeon Duo CPU with 32GB main memory and 8192K 
L2 Cache. The machine had no other significant CPU tasks running. The oper- 
ating system was Linux (Ubuntu 10.04) running kernel 3.0.0-26. The compiler 
was g-l--f (gcc version 4.4.3) executed with the -03 -static -DNDEBUG options. 
Times were recorded with the C clock function. All algorithms operate strictly 
in-memory. 



LZscan vs. other algorithms. We compared the LZscan implementation using 
our new approach for matching statistics boosted with the "skipping trick" (Sec- 
tion [4]) to algorithms based on compressed indexes (Section [s]). The experiments 
measured the time to compute LZ factorization with varying amount of avail- 
able working space. The results are shown in Figure[2] In almost all cases LZscan 
outperforms other algorithm across the whole space spectrum. Moreover, it can 
operate with very small available memory (close to n bytes) unlike other al- 
gorithms, which all require at least 2n space to compute BWT. It achieves a 
superior performance for highly repetitive data even at very low memory levels. 



Variants of LZscan. The second experiment measured the improvement of our 
new matching statistics computation over standard approach (see Section |4]). 
Additionally, each variant was tested with and without the "skipping trick", 
giving 4 combinations in total. The results are plotted in Figure [3j In nearly all 
cases applying any of our new techniques improves the runtime over the standard 
approach, but the best effect in all cases is achieved when the techniques are 
combined together. The total speedup then varies from a factor of 2 (dna) up to 
12 (einstein), clearly depending on the repetitiveness of input. 



'http : //pizzachili . dec .uchile . cl/texts .html' 

^ http : //pizzachili . dec .uchile . cl/repcorpus .html 
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dna 



ffl ISA6S 




LZ-FMI- 


ISA 


LZ-FMI- 


CN 


LZ-FMI- 


BPR 


LZscan 






1 2 3 4 5 
Space (bytes/char) 



2 3 4 5 
Space (bytes/char) 



Fig. 2. Time-space tradeoffs for various LZ77 factorization algoritlims. Tfie times do 
not include reading from or writing to disk. For algorithms with multiple parameters 
controlling time/space we show only the optimal points, that is, points forming the 
lower convex hull of the points "cloud" corresponding to various settings. The vertical 
line is the peak memory usage of BWT construction algorithm [25j . For comparison, 
we show the runtimes of ISA6s [TS], currently the fastest LZ77 factorization algorithm 
using 6n bytes. 
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Fig. 3. Time-space tradeoffs for different variants of LZscan algorithm. The variants 
differ in the subprocedure computing matching statistics. Each of two approaches can 
be additionally boosted by enabling the "skipping trick" yielding 4 different combina- 
tions. See Section |4] for details. The times do not include reading from or writing to 
disk. 
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